Purpose of Study: (Note: This is the last paragraph of the research proposal used to remind the reader of the research objectives.)

The present study will examine young adults from the National Epidemiologic Survey of Alcohol and Related Conditions (NESARC). The goals of the analysis will include 1) establishing the relationship between major depression and nicotine dependence; and 2) determining whether or not the relationship between major depression and nicotine dependence exists above and beyond smoking quantity.

Variables:

Variables from NESARC that will be used include: MAJORDEPLIFE (has subject experienced a major depression?), CHECK321 (has subject smoked cigarettes in past 12 months?), AGE (age of subject), TAB12MDX (tobacco dependence past 12 months), S3AQ3B1 (usual frequency when cigarettes smoked), ETHRACE2A (ethnicity of subject), SEX (gender of subject), and S3AQ3C1 (usual smoking quantity of cigarettes when smoked). 

Data Management

First, the data is placed on the search path using the PDS package. Next, a subset of people 25 or younger who smoked in the last 12 months is created using the filter function from dplyr. Note that CHECK321 == 1 is used to see if subject has smoked in the past 12 months and !is.na() is used to make sure the subset does not contain NAs for the chosen variables. Last, the variables of interest are selected and stored in the data frame nesarc.sub using the select function from the dplyr package.

library(PDS)
library(dplyr)
nesarc.sub <- NESARC %>% 
  filter(!is.na(CHECK321) & !is.na(AGE) & CHECK321 ==1 & AGE <= 25)
dim(nesarc.sub)
[1] 1706 3008
nesarc.sub <- nesarc.sub %>% 
  select(MAJORDEPLIFE, CHECK321, AGE, TAB12MDX, S3AQ3B1, S3AQ3C1, ETHRACE2A, SEX)
dim(nesarc.sub)
[1] 1706    8
summary(nesarc.sub)
 MAJORDEPLIFE CHECK321      AGE        TAB12MDX S3AQ3B1     S3AQ3C1    
 0:1260       1:1706   Min.   :18.00   0:810    1:1320   Min.   : 1.0  
 1: 446       2:   0   1st Qu.:20.00   1:896    2:  68   1st Qu.: 5.0  
              9:   0   Median :22.00            3:  91   Median :10.0  
                       Mean   :21.61            4:  88   Mean   :11.8  
                       3rd Qu.:23.00            5:  65   3rd Qu.:20.0  
                       Max.   :25.00            6:  71   Max.   :99.0  
                                                9:   3                 
 ETHRACE2A SEX    
 1:1060    1:856  
 2: 211    2:850  
 3:  42           
 4:  58           
 5: 335           
                  
                  

Running summary on the nesarc.sub reveals some non-obvious categories for the factors CHECK321 and S3AQ3B1. Reviewing the Code Book (HINT: Inside a pdf use shift F (Windows) or command F (Mac) then type the variable name inside the box to find the variable in the pdf), it is noted that a 9 is used to indicate missing values for S3AQ3B1 and a 99 is used to indicate missing values for S3AQ3C1.

Coding missing values

The variable S3AQ3B1 uses a 9 to record unknown for smoking frequency and a 99 is used to record unknown for S3AQ3C1.

nesarc.sub$S3AQ3B1[nesarc.sub$S3AQ3B1 == 9] <- NA
summary(nesarc.sub$S3AQ3B1)  # Note that 9 still appears
   1    2    3    4    5    6    9 NA's 
1320   68   91   88   65   71    0    3 
nesarc.sub$S3AQ3B1 <- factor(nesarc.sub$S3AQ3B1)[, drop = TRUE]
summary(nesarc.sub$S3AQ3B1)  # Unused level no longer appears
   1    2    3    4    5    6 NA's 
1320   68   91   88   65   71    3 
nesarc.sub$S3AQ3C1[nesarc.sub$S3AQ3C1 == 99] <- NA
summary(nesarc.sub$S3AQ3C1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    5.00   10.00   11.34   20.00   98.00       9 
nesarc.sub$CHECK321[nesarc.sub$CHECK321 == 9] <- NA
nesarc.sub$CHECK321 <- factor(nesarc.sub$CHECK321)[ , drop = TRUE]
summary(nesarc.sub)
 MAJORDEPLIFE CHECK321      AGE        TAB12MDX S3AQ3B1    
 0:1260       1:1706   Min.   :18.00   0:810    1   :1320  
 1: 446                1st Qu.:20.00   1:896    2   :  68  
                       Median :22.00            3   :  91  
                       Mean   :21.61            4   :  88  
                       3rd Qu.:23.00            5   :  65  
                       Max.   :25.00            6   :  71  
                                                NA's:   3  
    S3AQ3C1      ETHRACE2A SEX    
 Min.   : 1.00   1:1060    1:856  
 1st Qu.: 5.00   2: 211    2:850  
 Median :10.00   3:  42           
 Mean   :11.34   4:  58           
 3rd Qu.:20.00   5: 335           
 Max.   :98.00                    
 NA's   :9                        

Creating New Variables

To estimate the total number of cigarettes a subject smokes per month, convert S3AQ3B1 (a factor with 6 levels) to a numeric variable using as.numeric. DaysSmoke estimates the days per month a subject smokes. The variable TotalCigsSmoked estimates the monthly number of cigarettes a subject smokes per month by multiplying DaysSmoke times S3AQ3C1 (the usual quantity smoked per day). The numeric variable TotalCigsSmoked is converted into a factor with roughly equivalent numbers stored in each level of the factor CigsSmokedFac using the cut function.

nesarc.sub$DaysSmoke <- as.numeric(nesarc.sub$S3AQ3B1)
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 1] <- 30
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 2] <- 4*5.5
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 3] <- 4*3.5
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 4] <- 4*1.5
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 5] <- 2.5
nesarc.sub$DaysSmoke[nesarc.sub$DaysSmoke == 6] <- 1
# Using dplyr again
nesarc.sub <- nesarc.sub %>% 
  mutate(TotalCigsSmoked = DaysSmoke*S3AQ3C1)
proportions <- quantile(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
proportions
  0%  25%  50%  75% 100% 
   1   90  300  600 2940 
nesarc.sub$CigsSmokedFac <- cut(nesarc.sub$TotalCigsSmoked, breaks = proportions, include.lowest = TRUE)
head(nesarc.sub)
  MAJORDEPLIFE CHECK321 AGE TAB12MDX S3AQ3B1 S3AQ3C1 ETHRACE2A SEX
1            0        1  25        1       1       3         2   2
2            1        1  21        1       2       3         5   1
3            1        1  24        1       1      10         1   2
4            1        1  23        1       1      10         1   2
5            0        1  21        0       1      20         1   1
6            1        1  23        0       1       5         1   2
  DaysSmoke TotalCigsSmoked CigsSmokedFac
1        30              90        [1,90]
2        22              66        [1,90]
3        30             300      (90,300]
4        30             300      (90,300]
5        30             600     (300,600]
6        30             150      (90,300]

Labeling Variables

In the next section of code, variables are labeled and levels of factors are given informative labels. The order of the levels is also rearranged for the variables S3AQ3B1, TAB12MDX, and SEX.

library(Hmisc)
label(nesarc.sub$TAB12MDX) <- "Tobacco Dependence past 12 Months"
label(nesarc.sub$CHECK321) <- "Smoked Cigarettes in the Past 12 Months"
label(nesarc.sub$S3AQ3B1) <- "Usual Smoking Frequency"
label(nesarc.sub$S3AQ3C1) <- "Usual Smoking Quantity"
nesarc.sub$MAJORDEPLIFE <- factor(nesarc.sub$MAJORDEPLIFE, 
                              labels = c("No Depression", "Yes Depression"))
nesarc.sub$S3AQ3B1 <- factor(nesarc.sub$S3AQ3B1, 
                         labels = c("Every Day", "5 to 6 Days/week", 
                                    "3 to 4 Days/week",   "1 to 2 Days/week", 
                                    "2 to 3 Days/month", 
                                    "Once a month or less"))
nesarc.sub$S3AQ3B1 <- factor(nesarc.sub$S3AQ3B1, 
                         levels = c("Once a month or less", "2 to 3 Days/month", 
                                    "1 to 2 Days/week",  "3 to 4 Days/week", 
                                    "5 to 6 Days/week", "Every Day"))
nesarc.sub$TAB12MDX <- factor(nesarc.sub$TAB12MDX, 
                         labels = c("No Nicotine Dependence", "Nicotine Dependence"))
nesarc.sub$TAB12MDX <- factor(nesarc.sub$TAB12MDX, 
                         levels = c("Nicotine Dependence", "No Nicotine Dependence"))
nesarc.sub$ETHRACE2A <- factor(nesarc.sub$ETHRACE2A, 
                               labels = c("Caucasian", "African American", 
                                          "Native American", "Asian", "Hispanic"))
nesarc.sub$SEX <- factor(nesarc.sub$SEX, labels = c("Male", "Female"))
table(nesarc.sub$SEX)

  Male Female 
   856    850 
nesarc.sub$SEX <- factor(nesarc.sub$SEX, levels = c("Female", "Male"))
summary(nesarc.sub)
         MAJORDEPLIFE  CHECK321      AGE       
 No Depression :1260   1:1706   Min.   :18.00  
 Yes Depression: 446            1st Qu.:20.00  
                                Median :22.00  
                                Mean   :21.61  
                                3rd Qu.:23.00  
                                Max.   :25.00  
                                               
                   TAB12MDX                   S3AQ3B1        S3AQ3C1     
 Nicotine Dependence   :896   Once a month or less:  71   Min.   : 1.00  
 No Nicotine Dependence:810   2 to 3 Days/month   :  65   1st Qu.: 5.00  
                              1 to 2 Days/week    :  88   Median :10.00  
                              3 to 4 Days/week    :  91   Mean   :11.34  
                              5 to 6 Days/week    :  68   3rd Qu.:20.00  
                              Every Day           :1320   Max.   :98.00  
                              NA's                :   3   NA's   :9      
            ETHRACE2A        SEX        DaysSmoke     TotalCigsSmoked 
 Caucasian       :1060   Female:850   Min.   : 1.00   Min.   :   1.0  
 African American: 211   Male  :856   1st Qu.:30.00   1st Qu.:  90.0  
 Native American :  42                Median :30.00   Median : 300.0  
 Asian           :  58                Mean   :25.07   Mean   : 319.4  
 Hispanic        : 335                3rd Qu.:30.00   3rd Qu.: 600.0  
                                      Max.   :30.00   Max.   :2940.0  
                                      NA's   :3       NA's   :9       
        CigsSmokedFac
 [1,90]        :429  
 (90,300]      :678  
 (300,600]     :503  
 (600,2.94e+03]: 87  
 NA's          :  9  
                     
                     

Renaming Variables

nesarc.sub <- nesarc.sub %>% 
  rename(Depression = MAJORDEPLIFE, TobaccoDependence = TAB12MDX, 
         SmokingFreq = S3AQ3B1, DailyCigsSmoked = S3AQ3C1, Ethnicity = ETHRACE2A, 
         Sex = SEX, Age = AGE)
head(nesarc.sub)
      Depression CHECK321 Age      TobaccoDependence      SmokingFreq
1  No Depression        1  25    Nicotine Dependence        Every Day
2 Yes Depression        1  21    Nicotine Dependence 5 to 6 Days/week
3 Yes Depression        1  24    Nicotine Dependence        Every Day
4 Yes Depression        1  23    Nicotine Dependence        Every Day
5  No Depression        1  21 No Nicotine Dependence        Every Day
6 Yes Depression        1  23 No Nicotine Dependence        Every Day
  DailyCigsSmoked        Ethnicity    Sex DaysSmoke TotalCigsSmoked
1               3 African American Female        30              90
2               3         Hispanic   Male        22              66
3              10        Caucasian Female        30             300
4              10        Caucasian Female        30             300
5              20        Caucasian   Male        30             600
6               5        Caucasian Female        30             150
  CigsSmokedFac
1        [1,90]
2        [1,90]
3      (90,300]
4      (90,300]
5     (300,600]
6      (90,300]

Creating Tables

Basic tables can be created using the functions table or xtabs. Frequency tables are created for the variables TobaccoDependence, CigsSmokedFac, and SmokingFreq.

T1 <- xtabs(~TobaccoDependence, data = nesarc.sub)
T2 <- xtabs(~CigsSmokedFac, data = nesarc.sub)
T3 <- xtabs(~SmokingFreq, data = nesarc.sub)
table(nesarc.sub$TobaccoDependence)

   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
T1
TobaccoDependence
   Nicotine Dependence No Nicotine Dependence 
                   896                    810 
T2
CigsSmokedFac
        [1,90]       (90,300]      (300,600] (600,2.94e+03] 
           429            678            503             87 
T3
SmokingFreq
Once a month or less    2 to 3 Days/month     1 to 2 Days/week 
                  71                   65                   88 
    3 to 4 Days/week     5 to 6 Days/week            Every Day 
                  91                   68                 1320 

In the data frame nesarc.sub, there are 896 nicotine dependent subjects and 810 subjects that are not nicotine dependent. A small number of smokers (87) smoke over 600 cigarettes per month. Most of the subjects in nesarc.sub are daily smokers (1320) with the remainder distributed uniformly across the first five levels of SmokingFreq.

Graphing Frequency Tables

The barplots are all created with the package ggplot2. The barplots start with the defaults for the geom_bar and add more detail to the plot with each graph.

library(ggplot2)
g1 <- ggplot(data = nesarc.sub, aes(x = TobaccoDependence)) + 
  geom_bar() 
g1
<<<<<<< HEAD <<<<<<< HEAD

g2 <- ggplot(data = na.omit(nesarc.sub), aes(x = CigsSmokedFac)) +
  geom_bar(fill = "brown")
g2

=======

g2 <- ggplot(data = na.omit(nesarc.sub), aes(x = CigsSmokedFac)) +
  geom_bar(fill = "brown")
g2

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

g2 <- ggplot(data = na.omit(nesarc.sub), aes(x = CigsSmokedFac)) +
  geom_bar(fill = "brown")
g2

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
g3 <- ggplot(data = na.omit(nesarc.sub), aes(SmokingFreq)) +
  geom_bar(fill = "gray") +
  labs(x = "", y = "Total Number", title = "Smoking Frequency for Young Adults") +
  theme_bw() + 
  theme(axis.text.x  = element_text(angle = 75, vjust = 0.5)) 
g3
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Graphing Numeric Variables

One popular way to graph a continuous variable is to use a histogram. R has the base function hist which can be used for this purpose. We will also use the package ggplot2 to create histograms. We start with a basic histogram of the variable TotalCigsSmoked.

hist(nesarc.sub$TotalCigsSmoked)
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Experiment with the code in the previous code chunk to change the color, the title, and if needed the labels on the \(x\)- and \(y\)-axes.

ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked)) +
  geom_histogram(binwidth = 200, fill = "gray") + 
  labs(x = "Estimated Cigarettes Smoked per Month") +
  theme_bw()
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Creating Density Plots

ggplot(data = nesarc.sub, aes(x = TotalCigsSmoked)) +
  geom_density(fill = "sienna") + 
  labs(x = "Estimated Cigarettes Smoked per Month") +
  theme_bw()
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked)) +
  geom_density(fill = "slategrey") + 
  labs(x = "Reported Cigarettes Smoked per Day") +
  theme_bw()
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

The 3 S’s

summary(nesarc.sub$TotalCigsSmoked)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0    90.0   300.0   319.4   600.0  2940.0       9 
fivenum(nesarc.sub$TotalCigsSmoked)
[1]    1   90  300  600 2940
median(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 300
IQR(nesarc.sub$TotalCigsSmoked, na.rm = TRUE)
[1] 510

The shape of the distribution for the estimated cigarettes smoked per month is skewed to the right. The center (median) of the distribution is 300 and the spread (interquartile range) for the distribution is 510.

Bivariate Graphs (C \(\rightarrow\) C)

Consider using depression status (Depression) to predict nicotine dependence (TobaccoDependence).

T1 <- xtabs(~ TobaccoDependence + Depression, data = nesarc.sub)
T1
                        Depression
TobaccoDependence        No Depression Yes Depression
  Nicotine Dependence              574            322
  No Nicotine Dependence           686            124
T2 <- prop.table(T1, 2)
T2
                        Depression
TobaccoDependence        No Depression Yes Depression
  Nicotine Dependence        0.4555556      0.7219731
  No Nicotine Dependence     0.5444444      0.2780269
barplot(T2)
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

The last graph needs a legend as well as a title. While it is possible to construct a legend and title for the last graph, it is much better to use an approach that will generate the legend automatically.

g1 <- ggplot(data = nesarc.sub, aes(x = Depression, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  theme_bw()
g1
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Tweaking the last graph.

g1 + labs(x = "", y = "", 
          title = "Fraction of young adult smokers with and without\n nicotine dependence by depression status") + 
  scale_fill_discrete(name="Tobacco Addiction\nStatus") 
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
g2 <- ggplot(data = na.omit(nesarc.sub), aes(x = SmokingFreq, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  theme_bw() + 
  theme(axis.text.x  = element_text(angle = 75, vjust = 0.5)) +
  labs(x = "")
g2
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
 g3 <- g2 + facet_grid(Sex ~ .) +
  theme(axis.text.x  = element_text(angle = 85, vjust = 0.5)) +
  labs(x = "", y = "", title = "Fraction of young adult smokers with and without\n nicotine dependence by smoking frequency") + 
  scale_fill_discrete(name="Tobacco Addiction\nStatus")
 g3
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
ggplot(data = na.omit(nesarc.sub), aes(x = CigsSmokedFac, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  theme_bw() + facet_grid(Sex ~ Depression) +
  theme(axis.text.x  = element_text(angle = 85, vjust = 0.5)) +
  labs(x = "Estimated Number of Cigarettes Smoked per Month", y = "", title = "Fraction of young adult smokers with and without\n nicotine dependence by smoking quantity") + 
  scale_fill_discrete(name="Tobacco Addiction\nStatus") 
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Using Mosaic Plots

library(vcd)
mosaic(~TobaccoDependence + Depression, data = nesarc.sub) 
<<<<<<< HEAD <<<<<<< HEAD

mosaic(~TobaccoDependence + Depression, data = nesarc.sub, shade = TRUE) 

mosaic(~TobaccoDependence + Depression + Sex, data = nesarc.sub, shade = TRUE)

mosaic(~TobaccoDependence + Depression + Sex + CigsSmokedFac, data = nesarc.sub, shade = TRUE)  

=======

=======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
mosaic(~TobaccoDependence + Depression, data = nesarc.sub, shade = TRUE) 

mosaic(~TobaccoDependence + Depression + Sex, data = nesarc.sub, shade = TRUE)

mosaic(~TobaccoDependence + Depression + Sex + CigsSmokedFac, data = nesarc.sub, shade = TRUE)  
<<<<<<< HEAD

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Some Statistics

Note nesarc.sub is not the same data set discussed in the book. In the book, the subset is smaller since it includes only people who have smoked every day (S3AQ41). This subset has \((n = 1706)\) compared to the one in the book which has \(n = 1320\).

ggplot(data = nesarc.sub, aes(x = DailyCigsSmoked)) + 
   geom_density(fill = "gray") + 
   facet_grid(Depression~.) +
   theme_bw()
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
tapply(nesarc.sub$DailyCigsSmoked, list(nesarc.sub$Depression), sd, na.rm = TRUE) 
 No Depression Yes Depression 
      8.545279       9.172743 
 ggplot(data = nesarc.sub, aes(x = Depression, y = DailyCigsSmoked, fill = Depression)) + geom_violin() + 
   theme_bw()
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
t.test(DailyCigsSmoked ~ Depression, data = nesarc.sub)

    Welch Two Sample t-test

data:  DailyCigsSmoked by Depression
t = -1.7158, df = 732.843, p-value = 0.08663
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.8312941  0.1231774
sample estimates:
 mean in group No Depression mean in group Yes Depression 
                    11.11891                     11.97297 
t.test(TotalCigsSmoked ~ Depression, data = nesarc.sub)

    Welch Two Sample t-test

data:  TotalCigsSmoked by Depression
t = -1.8353, df = 734.092, p-value = 0.06686
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -59.652230   2.007771
sample estimates:
 mean in group No Depression mean in group Yes Depression 
                    311.8771                     340.6993 
t.test(TotalCigsSmoked ~ Depression, data = nesarc.sub, var.equal = TRUE) # assumptions not really met here

    Two Sample t-test

data:  TotalCigsSmoked by Depression
t = -1.8965, df = 1695, p-value = 0.05807
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -58.6306829   0.9862242
sample estimates:
 mean in group No Depression mean in group Yes Depression 
                    311.8771                     340.6993 
summary(aov(TotalCigsSmoked ~ Depression, data = nesarc.sub))
              Df    Sum Sq Mean Sq F value Pr(>F)  
Depression     1    272337  272337   3.597 0.0581 .
Residuals   1695 128346537   75721                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9 observations deleted due to missingness
# OR
mod <- lm(TotalCigsSmoked ~ Depression, data = nesarc.sub)
anova(mod)
Analysis of Variance Table

Response: TotalCigsSmoked
             Df    Sum Sq Mean Sq F value  Pr(>F)  
Depression    1    272337  272337  3.5966 0.05807 .
Residuals  1695 128346537   75721                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Taking a permuation approach

You should add your own notes here when I explain this in class.

set.seed(123)
SIMS <- 1000 - 1
stat <- numeric(SIMS)
TS <- t.test(DailyCigsSmoked ~ Depression, data = nesarc.sub)$stat
for(i in 1:SIMS){
  stat[i] <- t.test(DailyCigsSmoked ~ sample(Depression), data = nesarc.sub)$stat
}
ggplot(data = data.frame(stat = stat), aes(x = stat)) + 
  geom_density(fill = "pink") + 
  theme_bw() + 
  stat_function(fun = dt, arg = list(df = 733), color = "red", linetype = "dashed") + geom_vline(x = c(TS, -TS), linetype = "dashed", color = "blue")
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
TS
        t 
-1.715751 
pv <- (sum(stat >= abs(TS))*2 + 1) / SIMS
pv
[1] 0.08308308
# ANOVA
set.seed(123)
SIMS <- 1000 - 1
stat <- numeric(SIMS)
TSF <- anova(lm(DailyCigsSmoked ~ Depression, data = nesarc.sub))[1, 4]
for(i in 1:SIMS){
  stat[i] <- anova(lm(DailyCigsSmoked ~ sample(Depression), data = nesarc.sub))[1, 4]
}
ggplot(data = data.frame(stat = stat), aes(x = stat)) + 
  geom_density(fill = "pink") + 
  theme_bw() + 
  stat_function(fun = df, arg = list(df1 = 1, df2 = 1695), color = "red", linetype = "dashed") + geom_vline(x = TSF, linetype = "dashed", color = "blue") + ylim(0, 1)
<<<<<<< HEAD <<<<<<< HEAD

=======

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
TSF
[1] 3.149407
pvF <- (sum(stat >= TSF) + 1) / SIMS
pvF
<<<<<<< HEAD
[1] 0.07370737
<<<<<<< HEAD ======= =======
[1] 0.07007007
>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed

Another Example

ggplot(data = nesarc.sub, aes(x = Ethnicity, y = TotalCigsSmoked, fill = Ethnicity)) + 
  geom_boxplot() +
  guides(fill = FALSE) +
  theme_bw() + 
  labs(x = "", y = "Estimated Number of Cigarettes Smoked per Month\n")

ggplot(data = nesarc.sub, aes(x = Ethnicity, y = TotalCigsSmoked, fill = Ethnicity)) + 
  geom_violin() +
  guides(fill = FALSE) +
  theme_bw() + 
  labs(x = "", y = "Estimated Number of Cigarettes Smoked per Month\n")

\(H_0:\mu_{\text{Caucassian}} = \mu_{\text{African American}}=\mu_{\text{Native American}} =\mu_{\text{Asian}} =\mu_{\text{Hispanic}}\) versus \(H_1:\mu_{i} \neq \mu_{j}\) for some \((i, j)\).

eth.aov <- aov(TotalCigsSmoked ~ Ethnicity, data = nesarc.sub)
summary(eth.aov)
              Df    Sum Sq Mean Sq F value Pr(>F)    
Ethnicity      4   7012102 1753026   24.39 <2e-16 ***
Residuals   1692 121606772   71872                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9 observations deleted due to missingness
CIs <- TukeyHSD(eth.aov)
CIs
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = TotalCigsSmoked ~ Ethnicity, data = nesarc.sub)

$Ethnicity
                                       diff        lwr        upr
African American-Caucasian       -110.06777 -165.38997  -54.74558
Native American-Caucasian         -58.90587 -174.09624   56.28450
Asian-Caucasian                  -123.96375 -222.69939  -25.22811
Hispanic-Caucasian               -149.57543 -195.59596 -103.55490
Native American-African American   51.16190  -72.58143  174.90524
Asian-African American            -13.89598 -122.48842   94.69646
Hispanic-African American         -39.50766 -104.01720   25.00189
Asian-Native American             -65.05788 -213.38385   83.26809
Hispanic-Native American          -90.66956 -210.54348   29.20436
Hispanic-Asian                    -25.61168 -129.77339   78.55003
                                     p adj
African American-Caucasian       0.0000006
Native American-Caucasian        0.6300505
Asian-Caucasian                  0.0056089
Hispanic-Caucasian               0.0000000
Native American-African American 0.7911826
Asian-African American           0.9968050
Hispanic-African American        0.4514196
Asian-Native American            0.7527322
Hispanic-Native American         0.2356840
Hispanic-Asian                   0.9625311
opar <- par(no.readonly = TRUE)  # copy of current settings
par(mar =c(1, 15, 2, 1)) 
plot(CIs, las = 1)

par(opar)                        # restore original settings

Evidence (p = 1.190459710^{-19}) suggests a strong relationship between ethnicity and the estimated number of cigarettes an individual smokes per month. Post hoc comparisons suggest mean differences in the number of cigarettes smoked per month for African American and Caucasian, Asian and Caucasian, and Hispanic and Caucasian.

Chi Square Tests

library(vcd)
mosaic(~TobaccoDependence + Depression, data = nesarc.sub, shade = TRUE)

xtabs(~TobaccoDependence + Depression, data = nesarc.sub)
                        Depression
TobaccoDependence        No Depression Yes Depression
  Nicotine Dependence              574            322
  No Nicotine Dependence           686            124
T1 <- prop.table(xtabs(~TobaccoDependence + Depression, data = nesarc.sub), 1)
T1
                        Depression
TobaccoDependence        No Depression Yes Depression
  Nicotine Dependence        0.6406250      0.3593750
  No Nicotine Dependence     0.8469136      0.1530864
CR <- chisq.test(nesarc.sub$TobaccoDependence, nesarc.sub$Depression)
CR

    Pearson's Chi-squared test with Yates' continuity correction

data:  nesarc.sub$TobaccoDependence and nesarc.sub$Depression
X-squared = 92.6945, df = 1, p-value < 2.2e-16
<<<<<<< HEAD

A Chi Square test of independence suggest that among young adult smokers, those with past year nicotine dependence were more likely to have experiences major depression in thier lifetime (35.9375%) compared to those withou past year nicotine dependence (15.308642%), \(\chi^2 = 92.6945346\), 1 df, p < 0.0001.

>>>>>>> dc4cb233a02691157d230500c5b52ee09dd78410 =======

A Chi Square test of independence suggest that among young adult smokers, those with past year nicotine dependence were more likely to have experiences major depression in their lifetime (35.9375%) compared to those without past year nicotine dependence (15.308642%), \(\chi^2 = 92.6945346\), 1 df, p < 0.0001.

mosaic(~TobaccoDependence + SmokingFreq, data = nesarc.sub, shade = TRUE)

TA <- xtabs(~TobaccoDependence + SmokingFreq, data = nesarc.sub)
TA
                        SmokingFreq
TobaccoDependence        Once a month or less 2 to 3 Days/month
  Nicotine Dependence                       7                12
  No Nicotine Dependence                   64                53
                        SmokingFreq
TobaccoDependence        1 to 2 Days/week 3 to 4 Days/week
  Nicotine Dependence                  19               32
  No Nicotine Dependence               69               59
                        SmokingFreq
TobaccoDependence        5 to 6 Days/week Every Day
  Nicotine Dependence                  27       799
  No Nicotine Dependence               41       521
T1 <- prop.table(xtabs(~TobaccoDependence + SmokingFreq, data = nesarc.sub), 1)
T1
                        SmokingFreq
TobaccoDependence        Once a month or less 2 to 3 Days/month
  Nicotine Dependence              0.00781250        0.01339286
  No Nicotine Dependence           0.07930607        0.06567534
                        SmokingFreq
TobaccoDependence        1 to 2 Days/week 3 to 4 Days/week
  Nicotine Dependence          0.02120536       0.03571429
  No Nicotine Dependence       0.08550186       0.07311029
                        SmokingFreq
TobaccoDependence        5 to 6 Days/week  Every Day
  Nicotine Dependence          0.03013393 0.89174107
  No Nicotine Dependence       0.05080545 0.64560099
CR <- chisq.test(nesarc.sub$TobaccoDependence, nesarc.sub$SmokingFreq)
CR

    Pearson's Chi-squared test

data:  nesarc.sub$TobaccoDependence and nesarc.sub$SmokingFreq
X-squared = 165.2732, df = 5, p-value < 2.2e-16

The Chi Square test of independence suggest there is an association between smoking frequency and tobacco dependence. To determine if there are significant differences among the six levels of smoking frequency requires one to examine \(\binom{6}{2} =\) 15 possible differences with an appropriate \(\alpha\) value to account for the multiple comparisons. Use the Bonferroni approach where $= 0.05/(15) = 0.0033333 to control the experiment-wise error rate.

chisq.test(TA[, c(1, 2)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(1, 2)]
X-squared = 1.4349, df = 1, p-value = 0.231
chisq.test(TA[, c(1, 3)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(1, 3)]
X-squared = 3.1428, df = 1, p-value = 0.07626
chisq.test(TA[, c(1, 4)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(1, 4)]
X-squared = 12.6226, df = 1, p-value = 0.0003811
chisq.test(TA[, c(1, 5)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(1, 5)]
X-squared = 15.1695, df = 1, p-value = 9.828e-05
chisq.test(TA[, c(1, 6)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(1, 6)]
X-squared = 68.9247, df = 1, p-value < 2.2e-16
chisq.test(TA[, c(2, 3)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(2, 3)]
X-squared = 0.0743, df = 1, p-value = 0.7852
chisq.test(TA[, c(2, 4)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(2, 4)]
X-squared = 4.4318, df = 1, p-value = 0.03527
chisq.test(TA[, c(2, 5)]) # not significant 

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(2, 5)]
X-squared = 6.2484, df = 1, p-value = 0.01243
chisq.test(TA[, c(2, 6)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(2, 6)]
X-squared = 43.4608, df = 1, p-value = 4.325e-11
chisq.test(TA[, c(3, 4)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(3, 4)]
X-squared = 3.407, df = 1, p-value = 0.06492
chisq.test(TA[, c(3, 5)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(3, 5)]
X-squared = 5.2141, df = 1, p-value = 0.0224
chisq.test(TA[, c(3, 6)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(3, 6)]
X-squared = 49.7975, df = 1, p-value = 1.705e-12
chisq.test(TA[, c(4, 5)]) # not significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(4, 5)]
X-squared = 0.1768, df = 1, p-value = 0.6741
chisq.test(TA[, c(4, 6)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(4, 6)]
X-squared = 21.5899, df = 1, p-value = 3.376e-06
chisq.test(TA[, c(5, 6)]) # significant

    Pearson's Chi-squared test with Yates' continuity correction

data:  TA[, c(5, 6)]
X-squared = 10.7904, df = 1, p-value = 0.00102

There are significant differences between the percent of subjects that have nicotine dependence while smoking once a month or less and 3 to 4 Days/week, 5 to 6 Days/week, and Every Day. There are significant differences between percent of subjects that have nicotine dependence while smoking Every day and each of the less frequent smoking modes.

>>>>>>> 98e3b31f6d8c707564774e732a79d3a3f8ac22ed
library(moonBook)
out <- mytable(Depression + TobaccoDependence ~  Age + SmokingFreq + DailyCigsSmoked + Ethnicity + Sex +DaysSmoke + TotalCigsSmoked + CigsSmokedFac , data = nesarc.sub)
myhtml(out)
Descriptive Statistics stratified by Depression and TobaccoDependence
Depression No Depression Yes Depression
TobaccoDependence Nicotine Dependence
(N=574)
No Nicotine Dependence
(N=686)
p Nicotine Dependence
(N=322)
No Nicotine Dependence
(N=124)
p
Age 21.5 ± 2.3 21.7 ± 2.1 0.145 21.5 ± 2.2 21.8 ± 2.2 0.205
SmokingFreq 0.000 0.000
    Once a month or less 6 ( 1.0%) 43 ( 6.3%) 1 ( 0.3%) 21 (16.9%)
    2 to 3 Days/month 8 ( 1.4%) 41 ( 6.0%) 4 ( 1.2%) 12 ( 9.7%)
    1 to 2 Days/week 12 ( 2.1%) 60 ( 8.8%) 7 ( 2.2%) 9 ( 7.3%)
    3 to 4 Days/week 22 ( 3.8%) 50 ( 7.3%) 10 ( 3.1%) 9 ( 7.3%)
    5 to 6 Days/week 16 ( 2.8%) 34 ( 5.0%) 11 ( 3.4%) 7 ( 5.6%)
    Every Day 510 (88.9%) 455 (66.6%) 289 (89.8%) 66 (53.2%)
DailyCigsSmoked 13.6 ± 9.0 9.0 ± 7.5 0.000 13.8 ± 9.5 7.3 ± 6.1 0.000
Ethnicity 0.000 0.053
    Caucasian 377 (65.7%) 402 (58.6%) 214 (66.5%) 67 (54.0%)
    African American 77 (13.4%) 86 (12.5%) 27 ( 8.4%) 21 (16.9%)
    Native American 12 ( 2.1%) 9 ( 1.3%) 15 ( 4.7%) 6 ( 4.8%)
    Asian 25 ( 4.4%) 20 ( 2.9%) 10 ( 3.1%) 3 ( 2.4%)
    Hispanic 83 (14.5%) 169 (24.6%) 56 (17.4%) 27 (21.8%)
Sex 0.633 0.031
    Female 263 (45.8%) 304 (44.3%) 194 (60.2%) 89 (71.8%)
    Male 311 (54.2%) 382 (55.7%) 128 (39.8%) 35 (28.2%)
DaysSmoke 0.000 0.000
    1 18 ( 3.1%) 103 (15.1%) 8 ( 2.5%) 30 (24.2%)
    2.5 8 ( 1.4%) 41 ( 6.0%) 4 ( 1.2%) 12 ( 9.7%)
    14 22 ( 3.8%) 50 ( 7.3%) 10 ( 3.1%) 9 ( 7.3%)
    22 16 ( 2.8%) 34 ( 5.0%) 11 ( 3.4%) 7 ( 5.6%)
    30 510 (88.9%) 455 (66.6%) 289 (89.8%) 66 (53.2%)
TotalCigsSmoked 394.9 ± 280.5 241.7 ± 239.6 0.000 403.9 ± 294.2 175.7 ± 197.0 0.000
CigsSmokedFac 0.000 0.000
    [1,90] 69 (12.0%) 258 (38.0%) 40 (12.5%) 62 (50.4%)
    (90,300] 241 (42.0%) 262 (38.6%) 132 (41.1%) 43 (35.0%)
    (300,600] 224 (39.0%) 139 (20.5%) 122 (38.0%) 18 (14.6%)
    (600,2.94e+03] 40 ( 7.0%) 20 ( 2.9%) 27 ( 8.4%) 0 ( 0.0%)